The definition of adequate metrics between objects to be compared is at the core of many machine learning methods (e.g., nearest neighbors, kernel machines, etc.). When complex objects (such as time series) are involved, such metrics have to be carefully designed in order to leverage on desired notions of similarity.
Let us illustrate our point with an example.
The figure above is the result of a \(k\)-means clustering that uses Euclidean distance as a base metric. One issue with this metric is that it is not invariant to time shifts, while the dataset at stake clearly holds such invariants.
When using a shift-invariant similarity measure (discussed in our {ref}sec:dtw section) at the core of \(k\)-means, one gets:
This part of the course tackles the definition of adequate similarity measures for time series and their use at the core of machine learning methods.
This section covers works related to Dynamic Time Warping for time series.
Dynamic Time Warping (DTW) (Sakoe and Chiba 1978) is a similarity measure between time series. Consider two time series \(\mathbf{x}\) and \(\mathbf{x}^\prime\) of respective lengths \(n\) and \(m\). Here, all elements \(x_i\) and \(x^\prime_j\) are assumed to lie in the same \(p\)-dimensional space and the exact timestamps at which observations occur are disregarded: only their ordering matters.
In the following, a path \(\pi\) of length \(K\) is a sequence of \(K\) index pairs \(\left((i_0, j_0), \dots , (i_{K-1}, j_{K-1})\right)\).
DTW between \(\mathbf{x}\) and \(\mathbf{x}^\prime\) is formulated as the following optimization problem:
\[\begin{equation} DTW_q(\mathbf{x}, \mathbf{x}^\prime) = \min_{\pi \in \mathcal{A}(\mathbf{x}, \mathbf{x}^\prime)} \left( \sum_{(i, j) \in \pi} d(x_i, x^\prime_j)^q \right)^{\frac{1}{q}} \label{eq:dtw} \end{equation}\]
where \(\mathcal{A}(\mathbf{x}, \mathbf{x}^\prime)\) is the set of all admissible paths, i.e. the set of paths \(\pi\) such that:
\(\pi\) is a sequence \([\pi_0, \dots , \pi_{K-1}]\) of index pairs \(\pi_k = (i_k, j_k)\) with \(0 \leq i_k < n\) and \(0 \leq j_k < m\)
\(\pi_0 = (0, 0)\) and \(\pi_{K-1} = (n - 1, m - 1)\)
for all \(k > 0\) , \(\pi_k = (i_k, j_k)\) is related to \(\pi_{k-1} = (i_{k-1}, j_{k-1})\) as follows:
In what follows, we will denote \(DTW_2\) as DTW. In this context, a path can be seen as a temporal alignment of time series and the optimal path is such that Euclidean distance between aligned (ie. resampled) time series is minimal.
The following image exhibits the DTW path (in white) for a given pair of time series, on top of the cross-similarity matrix that stores \(d(x_i, {x}^\prime_j)\) values:
There exists an \(O(mn)\) algorithm to compute the exact optimum for this problem (assuming computation of \(d(\cdot,\cdot)\) is \(O(1)\)):
def dtw(x, x_prime, q=2):
for i in range(len(x)):
for j in range(len(x_prime)):
gamma[i, j] = d(x[i], x_prime[j]) ** q
if i > 0 or j > 0:
gamma[i, j] += min(
gamma[i-1, j ] if i > 0 else inf,
gamma[i , j-1] if j > 0 else inf,
gamma[i-1, j-1] if (i > 0 and j > 0) else inf
)
return (gamma[-1, -1]) ** (1. / q)
The basic idea behind this algorithm is that there exists a recurrence relationship between partial DTW computations. More precisely, if we denote by \(\gamma_{i,j}\) the \(DTW_q\) (at power \(q\)) similarity between sequences \(\mathbf{x}_{\rightarrow i}\) and \(\mathbf{x}^\prime_{\rightarrow j}\) (where the notation \(\mathbf{x}_{\rightarrow i}\) denotes sequence \(\mathbf{x}\) observed up to time \(i\)), then we have:
\[ \begin{aligned} \gamma_{i,j} &=& DTW_q(\mathbf{x}_{\rightarrow i}, \mathbf{x}^\prime_{\rightarrow j})^q \\ &=& \min_{\pi \in \mathcal{A}(\mathbf{x}_{\rightarrow i}, \mathbf{x}^\prime_{\rightarrow j})} \sum_{(k, l) \in \pi} d(x_k, x^\prime_l)^q \\ &\stackrel{*}{=}& d(x_i, x^\prime_j)^q + \min_{\pi \in \mathcal{A}(\mathbf{x}_{\rightarrow i}, \mathbf{x}^\prime_{\rightarrow j})} \sum_{(k, l) \in \pi[:-1]} d(x_k, x^\prime_l)^q \\ &\stackrel{**}{=}& d(x_i, x^\prime_j)^q + \min (\gamma_{i-1, j}, \gamma_{i, j-1}, \gamma_{i-1, j-1}) \end{aligned} \]
and \(DTW_q(\mathbf{x}, \mathbf{x}^\prime)\) is then \((\gamma_{n, m})^{\frac{1}{q}}\). In more details:
The dynamic programming algorithm presented above relies on this recurrence formula and stores intermediate computations for efficiency.
A Dynamic Time Warping path can either be represented as a list of index pairs (as suggested earlier) or as a binary matrix whose non-zero entries are those corresponding to a matching between time series elements:
\[\begin{equation} (A_\pi)_{i,j} = \left\{ \begin{array}{rl} 1 & \text{ if } (i, j) \in \pi \\ 0 & \text{ otherwise} \end{array} \right. . \end{equation}\]
This is illustrated in the Figure below where the binary matrix is represented as a grid on which the DTW path \(\pi\) is superimposed, and each dot on the grid corresponds to a non-zero entry in \(A_\pi\):
Using matrix notation, Dynamic Time Warping writes:
\[\begin{equation*} DTW_q(\mathbf{x}, \mathbf{x}^\prime) = \min_{\pi \in \mathcal{A}(\mathbf{x}, \mathbf{x}^\prime)} \langle A_\pi, D_q(\mathbf{x}, \mathbf{x}^\prime) \rangle^{\frac{1}{q}} \end{equation*}\]
where \(D_q(\mathbf{x}, \mathbf{x}^\prime)\) stores distances \(d(x_i, x^\prime_j)\) at the power \(q\).
Dynamic Time Warping holds the following properties:
However, mathematically speaking, DTW is not a valid metric since it satisfies neither the triangular inequality nor the identity of indiscernibles.
The set of temporal deformations to which DTW is invariant can be reduced by imposing additional constraints on the set of acceptable paths. Such constraints typically consist in forcing paths to stay close to the diagonal.
The Sakoe-Chiba band is parametrized by a radius \(r\) (number of off-diagonal elements to consider, also called warping window size sometimes), as illustrated below:
The Itakura parallelogram sets a maximum slope \(s\) for alignment paths, which leads to a parallelogram-shaped constraint: